SNAC: Speaker-Normalized Affine Coupling Layer in Flow-Based Architecture for Zero-Shot Multi-Speaker Text-to-Speech
نویسندگان
چکیده
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a speech sample with the voice characteristic of an unseen speaker. The main challenge ZSM-TTS is increase overall speaker similarity for speakers. One most successful conditioning methods flow-based (TTS) utilize functions which predict scale and bias parameters affine coupling layers according given embedding vector. In this letter, we improve on previous method by introducing speaker-normalized (SNAC) layer allows synthesis in zero-shot manner leveraging normalization-based technique. newly designed explicitly normalizes input predicted from vector while training, enabling inverse process denormalizing new at inference. proposed scheme yields state-of-the-art performance terms quality setting.
منابع مشابه
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. As a starting point, we show improvements over the two state-ofthe-art approaches for single-speaker neural TTS: Deep Voice 1 and Tacotron. We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but con...
متن کاملSpeech detection for text-dependent speaker verification
The performance of text-dependent speaker verification systems degrades in noisy environment and when the true speaker utters words that are not part of the verification password. Energy-based voice activity detection (VAD) algorithms cannot distinguish between the true speaker’s speech and other background speech or between the speaker’s verification password and other words uttered by the spe...
متن کاملSpeaker-Dependent Dictionary-Based Speech Enhancement for Text-Dependent Speaker Verification
The problem of text-dependent speaker verification under noisy conditions is becoming ever more relevant, due to increased usage for authentication in real-world applications. Classical methods for noise reduction such as spectral subtraction and Wiener filtering introduce distortion and do not perform well in this setting. In this work we compare the performance of different noise reduction me...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کاملSpeaker normalized spectral subband parameters for noise robust speech recognition
This paper proposes speaker normalized spectral subband centroids (SSCs) as supplementary features in noise environment speech recognition. SSCs are computed as frequency centroids for each subband from the power spectrum of the speech signal. Since the conventional SSCs depend on formant frequencies of a speaker, we introduce a speaker normalization technique into SSC computation to reduce the...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Signal Processing Letters
سال: 2022
ISSN: ['1558-2361', '1070-9908']
DOI: https://doi.org/10.1109/lsp.2022.3226655